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Abstract 

Missing data is a recurrent issue in epidemiology where the infection process may be 
partially observed. Approximate Bayesian Computation, an alternative to data imputation 
methods such as Markov Chain Monte Carlo integration, is proposed for making inference 
in epidemiological models. It is a likelihood-frce method that relies exclusively on numerical 
simulations. ABC consists in computing a distance between simulated and observed summary 
statistics and weighting the simulations according to this distance. We propose an original 
extension of ABC to path- valued summary statistics, corresponding to the cumulated number 
of detections as a function of time. For a standard compartmcntal model with Succptiblc, 
Infectious and Recovered individuals (SIR), we show that the posterior distributions obtained 
with ABC and MCMC are similar. In a refined SIR model well-suited to the HIV contact- 
tracing data in Cuba, we perform a comparison between ABC with full and binned detection 
times. For the Cuban data, we evaluate the efficiency of the detection system and predict the 
evolution of the HIV-AIDS disease. In particular, the percentage of undetected infectious 
individuals is found to be of the order of 40%. 
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simulation-based inference; likelihood-free inference. 
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1 Introduction 

Mathematical modelling plays an important role for understanding and predicting the spread of 
diseases, as well as for comparing and evaluating public health policies. Although deterministic 
modelling can be a guide for describing epidemics, stochastic models have their importance in 
featuring realistic processes and in quantifying confidence in parameters estimates and prediction 
uncertainty [JJ. Standard models in epidemiology consist in compartmental models in which the 
population is structured in different classes composed of Susceptible, Infectious, and Recovered 
(or Removed) individuals (SIR). Parameter estimation for SIR models is usually a difficult task 
because of missing observations, which is a recurrent issue in epidemiology. When the infected 
population is partially observed or when the infection times are missing, the computation of the 
likelihood is numerically infeasible because it involves integration over all the unobserved events. 

Markov Chain Monte Carlo (MCMC) methods that treat the missing data as extra pa- 
rameters, have become increasingly popular for calibrating stochastic epidemiological models 
with missing data [2H l2"tJj [8]. However, MCMC may be computationally prohibitive for high- 
dimensional missing observations [HJ [23] and fine tuning of the proposal distribution is required 
for efficient algorithms [T7j- In this paper, we show that SIR models with missing observations 
can be calibrated with Approximate Bayesian Computation (ABC), an alternative to MCMC, 
originally proposed for making inference in population genetics [3]. This approach is not based on 
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the likelihood function but relies on numerical simulations and comparisons between simulated 
and observed summary statistics. 

This work is motivated by the study of the Cuban HIV-AIDS database that contains the 
dates of detection of the 8,662 individuals that have been found to be HIV positive in Cuba 
between 1986 and 2007 [11]. The database contains additional covariates including the manner 
by which an individual has been found to be HIV positive. The individuals can be detected 
either by random screening (individuals 'spontaneously' take a detection test) or contact-tracing. 
Contact-tracing consists in testing the sexual contacts of detected individuals [TS]. The total 
number of infectious individuals as well as the infection times are unknown. 

In Section [21 we introduce the stochastic SIR model with contact-tracing developed by [10] . 
Section [3] is devoted to ABC methods when all detection times are known. We propose an 
original extension of ABC to path- valued summary statistics consisting of the cumulated number 
of detections through time. For a simple SIR model, we compare numerically the posterior 
distributions obtained with ABC and MCMC. Section 4 deals with possibly noisy or binned 
detection times for which the previous path-valued statistics are unavailable. We introduce a 
finite dimensional vector of summary statistics and compare the statistical properties of point 
estimates and credibility intervals obtained with full and binned detection times. Finally, Section 
[5] concentrates on the analysis of the Cuban HIV-AIDS database. We address several questions 
concerning the dynamic of this epidemic: what is the percentage of the epidemic that is known 
[18 } I12 |, rTT] : how many new cases of HIV are expected in the forthcoming years; and what is the 
proportion of detections that is expected in the contact-tracing program. 

2 A stochastic SIR model for HIV-AIDS epidemics with contact- 
tracing 

We restrict our study to the sexually-transmitted epidemic of HIV in Cuba (90% of the epidemic, 
see |18j). For modelling the dynamics of the number of known and unknown HIV cases, we 
consider a SIR model developed by [ID] . The population is divided into three main classes 
S, / and R corresponding to the susceptible, infectious, and detected individuals considered 
as removed because we assume that they do not transmit HIV anymore (see Figure [1]). The 
population of the susceptible individuals, of size St, at time t > 0, consists of the sexually active 
seronegative individuals. Individuals immigrate into the class S with a rate Ao and leave it by 
dying/emigrating, with rate HoSt, or by becoming infected. The class of infectious individuals, 
of size It, corresponds to the seropositive individuals who have not taken a detection test yet 
and may thus contaminate new susceptible individuals. We assume that each infected individual 
may transmit the disease to a susceptible individual at rate Ai so that the total rate of infection 
is equal to XiStlt- Individuals leave the class / when they die/emigrate with a total rate of ^\It, 
or when they are detected to be HIV positive. 

The class R of the detected individuals, of size Rt, is subdivided into two subclasses. We 
denote by R\ (resp. R% ) the size of the removed population detected by random screening 
(resp. contact-tracing). The total rate of detection by random screening is \2h- For the rate of 
contact-tracing detection, the model shall capture the fact that the contribution of a removed 
individual depends on the time elapsed since she/he has been found to be HIV positive. We 
consider the two following expressions for the total rate of contact-tracing detection 



where T, denotes the time at which a removed individual i has been detected. The weight 
exp(— c(t — Ti)), with c > 0, determines the contribution of a removed individual i to the 




and 
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contact-tracing according to the time t — T, she/he has been detected. The first rate in (|2.1[) 
corresponds to a mass action principle, and is proportional to the sum of the contributions 
of the detected individuals. The second rate in (|2.ip corresponds to a model with frequency 
dependence. In the following, 6 = (p\, Ai, A2, A3, c) denotes the multivariate parameter of the 
model. The parameters of interest are here Ai, A2 and A3. 

2.1 Connection between the stochastic and the deterministic SIR model 

Here we consider the first rate of contact-tracing detection in (|2.ip . but similar results can be 
obtained for the second rate. [10] showed that in a large population limit, the SIR process of 
Section [5] can be modelled with stochastic differential equations and converges to the solution 
of the following system of ordinary differential equations 



where st and it denote the size of the susceptible and infectious populations, and where r t is the 
contribution of the removed individuals to the contact-tracing (see Section 1 of the supplemen- 
tary material for more details). 

Apart from the inherent stochastic nature of epidemic propagation that may be particularly 
important for small populations [2], considering a stochastic SIR model rather than its deter- 
ministic counterpart can present at least two important advantages. First, it is quite straightfor- 
ward to perform simulations from the stochastic model (see Section 2 of the supplementary ma- 
terial) and this is one motivation for considering ABC methods. Second, the individual-centered 
stochastic process suits the formalism of statistical methods, which are based on samples of 
individual data. Since the estimates of the stochastic process converge to the parameters of the 
ODEs [10], ABC provides a new alternative for calibrating parameters of ODEs [HE]- 

3 ABC with sufficient summary statistics for epidemic models 
3.1 Main principles of ABC 

For simplicity, we deal here with densities and not general probability measures. Let x be the 
available data and tt(9) be the prior. Two approximations are at the core of ABC. 

Replacing observations with summary statistics Instead of focusing on the posterior 
density p(6\x), ABC aims at a possibly less informative target density p(9\S(x) = s a b s ) oc 
Pr(s o b s \0)ir(8) where S is a summary statistic that takes its values in a normed space, and s b s 
denotes the observed summary statistic. The summary statistic S can be a d-dimensional vector 
or an infinite-dimensional variable such as a L 1 function. Of course, if S is sufficient, then the 
two conditional densities are the same. The target distribution will also be coined as the partial 
posterior distribution. 

Simulation-based approximations of the posterior Once the summary statistics have 
been chosen, the second approximation arises when estimating the partial posterior density 
p(0 I S(x) = s bs) an d sampling from this distribution. This step involves nonparametric kernel 
estimation and possibly correction refinements given in Section 14.21 




(2.2) 
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3.2 Sampling from the posterior 



The ABC method with smooth rejection generates random draws from the target distribution 
as follows [3] 

1. Generate N random draws (0i,Si), i = 1, . . . , N. The parameter 9i is generated from the 
prior distribution ir and the vector of summary statistics Sj is calculated for the i th data 
set that is simulated from the generative model with parameter 0j. 

2. Associate to the i th simulation the weight Wi = K$(si — s obs ), where 5 is a tolerance 
threshold and Kg a (possibly multivariate) smoothing kernel. 

3. The distribution {^2^ = iWi5g i )/('^2^ 1 Wi), in which 6g denotes the Dirac mass at 9, ap- 
proximates the target distribution. 



3.3 Point estimation and credibility intervals 

Once a sample from the target distribution has been obtained, several estimators may be consid- 
ered for point estimation of each one-dimensional parameter A,-, j = 1, 2, 3. Using the weighted 
sample Wi), i = 1, . . . , N, the mean of the target distribution p(\j\s b s ) is estimated by 



YliLl ^j,i W i _ Si=l ^j,i K s(Sj ~ S ob s ) 
Eill Wi Y£=l K s( S i - s obs) 



\ _ Z-ri=l ^3,i YV i _ Z^i=l /x J,«- fi o^« *obs) - _ i o q (? i \ 

Aj — — jv — ~ ~' 3 — l i 1 ' 6 \ 6 - 1 ) 



which is the well-known Nadaraya- Watson regression estimator of the conditional expectation 
E(Aj | s ofes)- We also compute the medians, modes, and 95% credibility intervals (CI) of the 
marginal posterior distribution (see Section 3 of the supplementary material). 



3.4 Data and choice of summary statistics 

Data Starting at the time of the first detection in 1986, the Cuban HIV-AIDS data consist 
principally of the detection times at which the individuals have been found to be HIV positive. 
At the time of the last detection event, in July 2007, there is a total of 8,662 individuals in the 
database. For each detection event, there is a label indicating if the individual has been detected 
by random screening or contact-tracing. 

Summary statistics We consider the two (infinite-dimensional) statistics (Rf,t € [0, T]) and 
(Rt,t G [0, T]). Their sum is equal to the cumulated number of detections since the beginning 
of the epidemic. Because the data consist of the detection times for the two different types of 
detection, these two statistics can simply be viewed as a particular coding of the whole dataset so 
that the partial posterior distribution p{9 \ R , i? 2 ) is equal to the posterior distribution p{9 \ x). 
The L 1 -norm between the i th simulated path R\ and the observed one R l obs is 

i-t 

\\R l obs - R\\\i = \R l obs ,s-4,s\ds , 1 = 1,2, i = l,...,N. (3.2) 
J o 

For computing the weights W%, we choose a product kernel so that Wi = Ks 1 (\\Rl bs —Rj ||i)-K"<5 2 (ll-^obs - 
-Rf ||i) w here 5±, 62 are 2 tolerance thresholds. Epanechnikov kernels are considered for K$ 1 and 
K$ 2 and 5\ and 82 are found by accepting a given percentage Ps 1 = Ps 2 of the simulations. 
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3.5 Comparisons between ABC and MCMC methods for a standard SIR 
model 

Following [3] a performance indicator for ABC techniques consists in their ability to replicate 
likelihood-based results given by MCMC. Here the situation is particularly favorable for com- 
paring the two methods since the partial and the full posterior are the same. In the following 
examples, we choose samples of small sizes (re = 3 and n = 29) so that the dimension of the 
missing data is reasonable and MCMC achieves fast convergence. For large sample sizes with 
high-dimensional missing data, MCMC convergence might indeed be a serious issue and more 
thorough updating scheme shall be implemented [9"1 124j. 

We consider the standard SIR model with no contact-tracing (A3 = 0). We choose gamma 
distributions for the priors of Ai and A2 with a shape parameters of 0.1 and rate parameters of 1 
and 0.1. The data consist of the detection times and we assume that the infection times are not 
observed. We implement the MCMC algorithm of [21] . A total of 10, 000 steps are considered 
for MCMC with an initial burn-in of 5, 000 steps. For ABC, the summary statistic consists of 
the cumulative number of detections as a function of time. A total of 100, 000 simulations are 
performed for ABC. 

The first example was previously considered by [21] . They simulated detection times by 
considering one initial infectious individual and by setting So = 9, A] = 0.12, and A2 = 1 (see 
Section 4 of the supplementary material for the data). As displayed by Figure [2J the posterior 
distributions obtained with ABC are extremely close to the ones obtained with MCMC provided 
that the tolerance rate is sufficiently small. We see that the tolerance rate changes importantly 
the posterior distribution obtained with ABC (see the posterior distributions for A2). 

In a second example, we simulate a standard SIR trajectory with Ai = 0.12, A2 = 1, So = 30 
and Jo = 1. The data now consist of 29 detection times (see Section 4 of the supplementary 
material). Once again, Figure [2] shows that the ABC and MCMC posteriors are close provided 
that the tolerance rate is small enough. ABC produces posterior distributions with larger tails 
compared to MCMC, even with the lowest tolerance rate of 0.1%. This can be explained by 
considering the extreme scenario in which the tolerance threshold 5 goes to infinity: every 
simulation has a weight of 1 so that ABC targets the prior instead of the posterior. As the prior 
has typically larger tails than the posterior, ABC inflates the posterior tails. 

4 Comparison between ABC with full and binned detection 
times 

When there is noise or when the detection times have been binned, the full observations (R\ , t € 
[0, T]) and (Rf,t € [0, T]) are unavailable. Then, we replace these summary statistics by a 
vector of summary statistics such as the numbers of detections per year during the observation 
period. Since these new summary statistics are not sufficient anymore, the new partial posterior 
distribution may be different from the posterior p(0\x). In the following, we compare point 
estimates and CIs obtained from ABC with full (method of Section [3j) and binned detection 
times (method of Section l4T2j) . 

4.1 A new set of summary statistics 

We consider a d-dimensional vector of summary statistics of three different types. First, we 
compute the numbers R\ and R\ of individuals detected by random screening and contact- 
tracing by the end of the observation period. Second, for each year j, we compute the numbers 
of individuals that have been found to be HIV positive R l j + \ — R l j, I = 1,2. The third type of 
summary statistics consists of the numbers of new infectious for each of the the sixth first years 
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Ij+i — Ij for j = 0, . . . , 5, as well as the mean time during which an individual is infected but 
has not been detected yet. This mean time corresponds to the mean sojourn time in the class 
I for the sixth first years. These summary statistics are available here because infectious times 
before 1992 are known. 

In order to compute the weights Wi, we consider the following spherical kernel K$(x) oc 
#(||H -1 :r||/J). Here K denotes the one-dimensional Epanechnikov kernel, |.| is the Euclidian 
norm of M. d and H" 1 a matrix. Because the summary statistics may span different scales, 
H is taken equal to the diagonal matrix with the standard deviation of each one-dimensional 
summary statistic on the diagonal. 

4.2 Curse of dimensionality and regression adjustments 

In the case of a d-dimensional vector of summary statistics, the estimator of the conditional mean 
()3.ip is convergent if the tolerance rate satisfies limjv-v+oo 5jy = 0, so that its bias converges to 
0, and liniAr_ s ._)_ 00 NSfj = +oo, so that its variance converges to [13] . As d increases, a larger 
tolerance threshold shall be chosen to keep the variance small. As a consequence, the bias 
may increase with the number of summary statistics. This phenomenon known as the curse 
of dimensionality may be an issue for the ABC-rejection approach. The following paragraph 
presents regression-based adjustments that cope with the curse of dimensionality. 

The adjustment principle is presented in a general setting within which the corrections of 
[3] and [6] can be derived. Correction adjustments aim at obtaining from a random couple 
(6i, Si) a random variable distributed according to p{9 \ s b s ). The idea is to construct a coupling 
between the distributions p(0\sj) and p(9\s b s ), through which we can shrink the 9^s to a sample 
of i.i.d. draws from p{9\s b s ). In the remaining of this subsection, we describe how to perform 
the corrections for each of the one-dimensional components separately. For 9 € R, correction 
adjustments are obtained by assuming a relationship 9 = G(s, e) =: G s (e) between the parameter 
and the summary statistics. Here G is a (possibly complicated) function and e is a random 
variable with a distribution that does not depend on s. A possibility is to choose G s = F~ , the 
(generalized) inverse of the cumulative distribution function of p(9\s). In this case, e = F s (9) is 
a uniform random variable on [0, 1]. The formula for adjustment is given by 



For G s = F~ , the fact that the 6*'s are i.i.d. with density p(9\s a b s ) arises from the stan- 
dard inversion algorithm. Of course, the function G shall be approximated in practice. As a 
consequence, the adjusted simulations 9*, i = 1,...,N, constitute an approximate sample of 
p(9 | s b s ). The ABC algorithm with regression adjustment can be described as follows 

1. Simulate, as in the rejection algorithm, a sample Sj), i = 1, . . . ,N. 

2. By making use of the sample of the (9i, Sj)'s weighted by the Wj's, approximate the function 
G such that B% = G(si,Ei) in the vicinity of s b s . 

3. Replace the 9^s by the adjusted 0*'s. The resulting weighted sample (#*, Wi), i = 1, . . . , N, 
form a sample from the target distribution. 

Beaumont et al. local linear regressions (LOCL) The case where G is approximated by a 
linear model G(s, e) = a + s*/3 + e, was considered by [3]. The parameters a and (3 are inferred 
by minimizing the weighted squared error J2^ = i Ks(si — s b s ){9i — (a + (sj — s b s ) T fi)) 2 ■ Using 
(|4.ip . the correction of [3] is derived as 
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Asymptotic consistency of the estimators of the partial posterior distribution with the correction 
P~2"j) is obtained by [5]. 

Blum and Francois' nonlinear conditional heteroscedastic regressions (NCH) To re- 
lax the assumptions of homoscedasticity and linearity inherent to local linear regression, [6] 
approximated G by G(s,e) = m(s) + cr(s) x e where m(s) denotes the conditional expectation, 
and ct 2 (s) the conditional variance. The estimators m and logo -2 are found by adjusting two 
feed-forward neural networks using a regularized weighted squared error. For the NCH model, 
parameter adjustment is performed as follows 

0* = rn(sobs) + (Pi ~ m{si)) * i=l,...,N. 

cr{Si) 

In practical applications of the NCH model, we train L = 10 neural networks for each condi- 
tional regression (expectation and variance) and we average the results of the L neural networks 
to provide the estimates m and logcr 2 . 

Reparameterization In both regression adjustment approaches, the regressions can be per- 
formed on transformations of the responses 9i rather that on the responses themselves. Pa- 
rameters whose prior distributions have finite supports are transformed via the logit function 
and non-negative parameters are transformed via the logarithm function. These transforma- 
tions guarantee that the #*'s lie in the support of the prior distribution and have the additional 
advantage of stabilizing the variance. 

4.3 A comparison for simulated datasets 

In order to work on data similar to the Cuban database, we simulate M = 200 synthetic data 
sets for the HIV-AIDS epidemic with ft=2x 10~ 6 , Ai = 1.14 x 10" 7 , A 2 = 3.75 x HT 1 , 
A3 = 6.55 x 10 -5 , and c = 1 [10j . The initial conditions are set to 5o = 6 x 10 6 , the size of the 
Cuban population in the age-group 15-49, Iq = 232 and Rq = |11| . Here we simulate only 6 
years of the epidemics. 

We study four variants of ABC for estimating Ai, A2, and A3: one with the two path-valued 
summary statistics and three with the vector of summary statistics. When using the finite di- 
mensional vector of summary statistics, we perform the smooth rejection approach as well as the 
LOCL and NCH corrections with a total of 21 summary statistics: the 18 summary statistics 
corresponding to the yearly increases of R , R 2 , and /; the final numbers of detected individuals 
Rq and Rq] and the mean sojourn time in the class I. Each of the M = 200 estimations of the 
partial posterior distributions are performed using a total of N = 5000 simulations of the SIR 
model with the mass action principle (first rate in (|2.ip ). 

Prior distributions The prior distributions for (n, Ai, A2 and A3 are chosen to be uniform on a 
log scale. The choice of a log scale reflects our uncertainty about the order of magnitude of the 
parameters. The prior distribution for log 10 (/ii) is U{— 6, —4) where U(a, b) denotes the uniform 
distribution on the interval (a, b). The prior distribution is U(—9, —6) for log 10 (Ai), IA{— 4, 3) for 
log 10 (A2), and U{— 8,2) for log 10 (A3). The bounds of the uniform distributions are set to keep 
the simulations from being degenerate. The prior for c is log(2)/W(l/12, 5) so that the half-life 
of t 1— > e~ c *, which measures the individual contribution to the detection by contact-tracing, is 
uniformly distributed between 1/12 and 5 years. 

Point estimates of 6 and credibility intervals Figure [3] displays the boxplots of the 200 
estimated modes, medians, 2.5% and 97.5% quantiles of the posterior distribution for Ai as a 
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function of the tolerance rate P$ (see Figures 1 and 2 of the supplementary material for A2 and 
A3). First, we find that the medians and modes are equivalent except for the rejection method 
with 21 summary statistics for which the mode is less biased. For the lowest tolerance rates, the 
point estimates obtained with the four possible methods are close to the value Ai = 1.14 x 1CT 7 
used in the simulations, with smaller CI for the LOCL and NCH variants. When increasing 
the tolerance rate, the bias of the point estimates obtained with the rejection method with 21 
summary statistics slightly increases. By contrast, up to tolerance rates smaller than 50%, the 
biases of the point estimates obtained with the three other methods remain small. As can be 
expected, the widths of the CI obtained with the rejection methods increase with the tolerance 
rate while they remain considerably less variable for the methods with regression adjustment. 

Mean square error For further comparison of the different methods, we compute the rescaled 
mean square errors (RMSEs). The RMSEs are computed on a log scale and rescaled by the 
range of the prior distribution so that 

1 " (log(A>)-log(A,-)) 2 

where A^ is a point estimate obtained with the k til synthetic data set. We find that the smallest 
values of the RMSEs are usually reached for the lowest value of the tolerance rate (see Figure 
H]). For Ai and A2, the RMSEs of the point estimates obtained with the four different methods 
are comparable for the lowest tolerance rate. However, the smallest values of the RMSEs are 
always found when performing the rejection method with the two sufficient summary statistics 
R 1 and R 2 . This finding is even more pronounced for the parameter A3. 

Rescaled mean credibility intervals To compare the whole posterior distributions obtained 
with the four different methods, we additionally compute the different CIs. The rescaled mean 
CI (RMCI) is defined as follows 

1 M \IC k \ 

RMCI=— V- ' j . tt .. . j = 1,2,3, (4.4) 

M Range(pnor(Aj)) 

where \ICj\ is the length of the k th estimated 95% CI for the parameter Aj. As displayed by 
Figure [3l the CIs obtained with smooth rejection increase importantly with the tolerance rate 
whereas such an important increase is not observed with regression adjustment. In Figure HI the 
CIs obtained with the NCH method are clearly the thinnest, those obtained with the rejection 
methods are the widest and those obtained with the LOCL method have intermediate width. 
In the following, we perform the NCH correction when considering the finite dimensional vector 
of summary statistics. This choice is motivated by the small RMSEs and RMCIs obtained with 
the NCH method (Figure 01 . 



5 Application to the Cuban HIV-AIDS epidemic 

We calibrate the SIR model with contact-tracing to the Cuban HIV-AIDS database. We consider 
two methods: smooth rejection ABC with the two path- valued summary statistics (Section [3]), 
and the NCH- ABC with the vector of summary statistics (Section 14. 2p . For the Cuban data, 
this vector is of dimension 51. 
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5.1 Parameter calibration and goodness of fit 



For the ABC algorithm, we perform a total of 100,000 simulations, we consider the two different 
rates of contact-tracing detection (|2.ip . and we use the same initial conditions and priors as 
in Section [4.31 To set the value of the tolerance rate P$, we consider the 15 first years of the 
epidemic as the training data and choose the value of the tolerance rate P$ that minimizes the 
prediction error at the end of the epidemic (T = 21.5) 



Pred Error = E 



1-^21.5 (-ftf) Robs, 21. 5I , 1-^21. 5(-^5) -^obs,21.5l 



pi 1 p2 

rl o6s,21.5 JX 6bB,2X.b 



(5.1) 



For the optimal tolerance rate P$, we investigate the goodness of fit of the SIR- type model. 
By simulating paths of the SIR model associated with parameters 9 sampled from the partial 
posterior distribution, we check if the model reproduces a posteriori the observed summary 
statistics |T6j. In Figure [5l we display the Posterior Predictive Distributions (PPD) of different 
summary statistics. Figure [5] has been obtained by considering the two sufficient summary 
statistics R 1 and R 2 , using the optimal tolerance rates of Pg 1 = P$ 2 = 1%, and considering 
the model of frequency dependence (second rate in (|2.ip ). The cumulated number of detected 
individuals are contained in the ranges of the PPDs. By contrast, the mean sojourn time in 
the class I is not contained in the PPD and the observed number of infectious individuals is 
in the lower tail of the PPD. An explanation might be that an age-structure has to be taken 
into account for the infection rate in order to capture the non-Markovian effects (e.g. [25] )■ A 
model with an increasing infection rate could diminish the mean sojourn time in the class I and 
increase by compensation the number of infections to maintain the infection pressure constant. 
When considering the model with a mass action principle (first rate in (|2.ip ). we observed (see 
Figure 3 of the supplementary material) that the statistic R^ bs 2 \ 5 is n °t contained in the PPD. 
With a mass action principle, the rate of contact-tracing detection increases linearly with the 
contribution of the detected individuals, and that is too rapid in comparison with the data. 
Last, we find that the PPDs obtained with the NCH method have extremely wide supports 
for both rates of contact-tracing detection (see Figure 4 of the supplementary material). These 
large PPDs suggest that the summary statistics measuring the detections and the infections may 
contain conflicting signals, which results in a large variance of the partial posterior distribution. 

To provide point estimates and CIs, we consider the model that provides the best fit (model 
with frequency dependence fitted with the two trajectories R 1 and R 2 ). For point estimation, we 
compute the posterior mode. The estimate of Ai is5.4xl0 -8 (95%CI [3.9 x 10~ 8 ; 2.3 x 10~ 7 ]), the 
estimate of A 2 is 0.13 (95%CI [0.007; 1.17]), and the estimate of A 3 is 0.19 (95%CI [0.03; 0.82]). 
The point estimate of the rate of infection Ai implies that the net rate of infection per infectious 
individuals \\S is equal to 0.32 (95%CI [0.23; 1.37]). This means that the waiting time before 
an infectious individual, that has not been detected yet, infects an other individual is 3.1 years 
(95%CI [0.72; 4.34]). 



5.2 The dynamic of the Cuban HIV-AIDS epidemic 

Reconstruction of the cumulative numbers of detections Figure [6] displays the dynam- 
ics predicted by the SIR model for the numbers of individuals detected by random screening, 
contact-tracing and for the number of unknown infectious individuals. Interestingly, there is 
a really good fit between the real and predicted numbers of individuals detected by random 
screening except between 1992 and 1995. This period corresponds to the period of crisis that 
followed the collapse of the Soviet Union and during which the HIV detection system received 
less attention. We also find a slight discrepancy in the recent years (2000-2007) between the 
real and predicted numbers of detections by contact-tracing, which may reveal a weakening in 
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the contact-tracing system. An explanation is that a new mode of detection, related to contact- 
tracing but still counted as random detection, has been developed. This new type of detection 
is promoted by the family doctors who ask to their patients the names of individuals at risk (H. 
De Arazoza, personal communication). 

Performance of the contact-tracing system When testing the performance of contact- 
tracing, [18] computed the coverage of the epidemic defined as the percentage of infectious 
individuals that have been detected (R 1 + R 2 )/(I + R 1 + R 2 ). As displayed by Figure El the 
SIR model predicts a coverage of 62% (95%CI [36%; 66%]) in 2000 that is much lower than a 
coverage of 83% (95%CI [75%; 87%]) as inferred by [IS]. However, since the PPDs of Figure [S] 
show that the SIR model predicts less infectious individuals than observed, the coverage might 
still be overestimated and would consequently be even smaller than 62%. 

Using this estimation of the coverage, we can compare the rates of detection by random 
screening and contact-tracing per infectious individual. The estimated per capita rate of ran- 
dom screening is A2 = 0.13. The per capita rate of contact-tracing equals A3 YlieR ex P( — ~~ 
Ti))/(It + SiGK ex P( — ~~ ^i)))- Using a zero-order expansion, we find that this rate can be 
approximated by the product of A3 with the coverage of the epidemic. Hence, the per capita 
rate of contact-tracing can be estimated as 0.19 x 0.62 ~ 0.12 that is almost equal to A2- 

Predictions Additionally, simulations of the SIR model provide predictions for the evolu- 
tion of the HIV dynamic in the forthcoming years. The SIR model predicts that in 2015, 
42,000 (95%CI [29, 000; 107, 000]) individuals will be infected since the beginning of the epi- 
demic in Cuba. Among these infected individuals, a proportion of 45% (95%CI [29%; 46%]) 
will be detected by random screening and a proportion of 21% (95%CI [10%; 22%]) by contact- 
tracing. As displayed by Figure (U the SIR-type model with contact-tracing predicts that the 
total proportion of individuals detected by contact-tracing will reach an asymptote of 32% 
(95%CI [25%; 33%]) in 2015. The total number of infected individuals in 2015 corresponds to 
27, 000 (95%CI [19, 000; 80, 000]) new cases of HIV between July 2007 and January 2015. In the 
same period of time, the SIR model predicts that 12,000 individuals (95%CI [9, 000; 24, 000]) 
will be detected by random screening and 6, 000 individuals (95%CI [4, 000; 8, 000]) by contact- 
tracing. 

6 Conclusions 

In the context of temporal epidemiological data, ABC techniques can provide accurate estimates 
of the parameters of interest such as the infection and detection rates \19\ I28|. ABC relies on 
simulations of the model and can therefore be applied to various epidemiological models. 

For partially observed population and missing infectious times, MCMC methods require to 
reconstruct the unknown data which can be highly computationally intensive for large popu- 
lations. For instance, [21] and [25] considered MCMC algorithms for populations consisting of 
about 100 individuals whereas the Cuban HIV-AIDS database contains almost 10,000 known 
HIV positive individuals, which makes the total (known and unknown) number of infectious 
individuals even larger. When the dimension of the missing data, the infection times here, is 
both large and unknown, data imputation with MCMC can be computationally very demanding 
and takes several days on a parallelized system |24j . 

However, compared with the abundant MCMC literature, the experience of statisticians with 
ABC is still rather limited. For MCMC algorithms, theoretical convergence results [26] as well as 
practical methods for monitoring numerical convergence are available Q2]. Although theoretical 
results are now available for ABC [5] , there is no method for evaluating the two approximations 
inherent to ABC (Section 13. 1|) . Here, we are in a favorable situation since the full summary 
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statistics are sufficient so that the partial and the full posterior are the same. However, for 
the second approximation, the practice of conditional density estimation in high dimensions 
remains an issue. When comparing posterior distributions obtained with MCMC and ABC for 
a standard SIR model, we find the same modes provided that the tolerance rate is small enough. 
However, even for the smallest tolerance rate, we find that ABC generates posterior distributions 
with larger tails compared with MCMC. More generally, ABC applications have been restricted 
to models with moderate number of parameters whereas MCMC applications can involve a 
very large number of parameters ([22]). For models with a substantial numbers of parameters, 
adaptive ABC algorithms that use the simulations to modify the sampling distribution of the 
parameter 0, might constitute interesting ways to explore for the future of ABC in epidemiology 

[22 El E7]. 

In this paper, we consider both finite and infinite dimensional summary statistics for ABC. 
When comparing ABC with the two different sets of statistics, we find that the point estimates 
of the parameters Ai, A2, A3, with the smallest quadratic errors are obtained with the rejection 
method based on the infinite-dimensional statistics. However, the 95% CIs obtained with this 
method are large and critically depend on the tolerance rate. By contrast, regression-based 
adjustment methods, and the NCH method more particularly, considerably shorten the CIs and 
are less sensitive to the tolerance rate. Applications of regression-based ABC methods constitute 
therefore a solution for 'stabilizing' the CIs. However, no ABC with regression adjustment have 
been developed so far for infinite-dimensional summary statistics. 

In the last section of the paper, we calibrate the SIR model to the Cuban HIV- AIDS data. By 
comparing the posterior predictive distributions of the total number of detections, we find that 
the model with a frequency-dependent rate of contact-tracing provides the best fit to the data. 
Using this model, we compare the present-day rates of contact-tracing and random screening. 
We find that they are almost the same and equal to 0.13/individual/year. Converting rates of 
detection to waiting times before detections, we find that the waiting time before an individual 
infected today will be detected is equal to 1/(2 x 0.13) « 3.8 years. At the time of detection, 
both types of detection are equally probable. Although it suggests that contact-tracing detection 
contributes importantly to HIV screening in Cuba, we find that the screening might have been 
largely incomplete. The percentage of undetected individuals among the infectious individuals 
might have been underestimated [18] and would be of the order of 40%. 
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Figure 1: Schematic description of the SIR model with contact-tracing. 
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a) 




b) 




Figure 2: Comparison of the posterior densities obtained with MCMC and ABC. The vertical 
lines correspond to the values of the parameters used for generating the synthetic data, a) The 
data consist of 3 detection times that have been simulated by [21]. b) The data consist of 29 
detection times that we simulated by setting Ai = 0.12, A2 = 1, Sq = 30, Iq = 1, and T = 5 (see 
the supplementary material for the 29 detection times). 
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Figure 3: Boxplots of the M = 200 estimated modes and quantiles (2.5%, 50%, and 97.5%) of 
the partial posterior distributions of Ai- For each ABC method and each value of the tolerance 
rate, 200 posterior distributions are computed for each of the 200 synthetic data sets. The 
horizontal lines correspond to the true value Ai = 1.14 x 10 -7 used when simulating the 200 
synthetic data sets. The different tolerance rates are 0.01, 0.05, 0.10, 0.25, 0.50, 0.50, 0.75, and 
1 for all the ABC methods except the rejection scheme with the two summary statistics. For the 
latter method, the tolerance rates are 0.007, 0.02, 0.06, 0.13, 0.27, 0.37, 0.45, 0.53, 0.66, 0.80, 1. 
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Figure 4: Rescaled mean squared error (RMSE) of the posterior mode and rescaled mean cred- 
ibility interval (RMCI). 
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Figure 5: Bayesian posterior predictive distributions of 5 , -R| L5 , Iq, and the mean sojourn 
time in the class I. The SIR model corresponds to the model with frequency dependence for 
contact-tracing detection. The partial posterior samples are obtained with the smooth rejection 
ABC algorithm by making use of the 2 infinite-dimensional summary statistics R 1 and R 2 . 
Tolerance rate of Pg 1 = P$ 2 = 1% are considered for each summary statistic. 
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Figure 6: Median and 95% credibility intervals of the posterior predictive distributions of Rj, 
Rf, R\/(R} + R%) and coverage Rt/{h + Rt) from 1986 to 2015. The coverage is defined as 
the proportion of known HIV positive individuals. The posterior samples are generated by the 
rejection scheme with the two summary statistics. A tolerance rate of P$ = 1% is considered for 
each summary statistic. 
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